multidimensional data
Shape Proportions and Sphericity in n Dimensions
Shape metrics for objects in high dimensions remain sparse. Those that do exist, such as hyper-volume, remain limited to objects that are better understood such as Platonic solids and $n$-Cubes. Further, understanding objects of ill-defined shapes in higher dimensions is ambiguous at best. Past work does not provide a single number to give a qualitative understanding of an object. For example, the eigenvalues from principal component analysis results in $n$ metrics to describe the shape of an object. Therefore, we need a single number which can discriminate objects with different shape from one another. Previous work has developed shape metrics for specific dimensions such as two or three dimensions. However, there is an opportunity to develop metrics for any desired dimension. To that end, we present two new shape metrics for objects in a given number of dimensions: hyper-Sphericity and hyper-Shape Proportion (SP). We explore the proprieties of these metrics on a number of different shapes including $n$-balls. We then connect these metrics to applications of analyzing the shape of multidimensional data such as the popular Iris dataset.
Compressing (Multidimensional) Learned Bloom Filters
Davitkova, Angjela, Gjurovski, Damjan, Michel, Sebastian
Bloom filters are widely used data structures that compactly represent sets of elements. Querying a Bloom filter reveals if an element is not included in the underlying set or is included with a certain error rate. This membership testing can be modeled as a binary classification problem and solved through deep learning models, leading to what is called learned Bloom filters. We have identified that the benefits of learned Bloom filters are apparent only when considering a vast amount of data, and even then, there is a possibility to further reduce their memory consumption. For that reason, we introduce a lossless input compression technique that improves the memory consumption of the learned model while preserving a comparable model accuracy. We evaluate our approach and show significant memory consumption improvements over learned Bloom filters.
Interpretable Knowledge Discovery Reinforced by Visual Methods
Editor's Note: See Boris Kovalerchuk's talk "Interpretable Knowledge Discovery Reinforced by Visual Methods" at ODSC West 2019. Visual reasoning and discovery have a long history. Chinese and Indians had visual proof of the Pythagorean Theorem in 600 B.C. before it was known to the Greeks. Scientists such as Bohr, Boltzmann, Einstein, Faraday, Feynman, Heisenberg, Helmholtz, Herschel, Kekule, Maxwell, Poincare, Tesla, Watson, and Watt have declared the fundamental role that images played in their most creative thinking. The fundamental challenge for visual creative thinking and discovering in multidimensional data (n-D data) used in machine learning (ML) is that we cannot see multidimensional data with a naked eye.
Distributed Convolutional Dictionary Learning (DiCoDiLe): Pattern Discovery in Large Images and Signals
Moreau, Thomas, Gramfort, Alexandre
Convolutional dictionary learning (CDL) estimates shift invariant basis adapted to multidimensional data. CDL has proven useful for image denoising or inpainting, as well as for pattern discovery on multivariate signals. As estimated patterns can be positioned anywhere in signals or images, optimization techniques face the difficulty of working in extremely high dimensions with millions of pixels or time samples, contrarily to standard patch-based dictionary learning. To address this optimization problem, this work proposes a distributed and asynchronous algorithm, employing locally greedy coordinate descent and an asynchronous locking mechanism that does not require a central server. This algorithm can be used to distribute the computation on a number of workers which scales linearly with the encoded signal's size. Experiments confirm the scaling properties which allows us to learn patterns on large scales images from the Hubble Space Telescope.
Scope of Research on Particle Swarm Optimization Based Data Clustering
Metre, Vishakha A, Deshmukh, Mr Pramod B
Optimization is nothing but a mathematical technique which finds maxima or minima of any function of concern in some realistic region. Different optimization techniques are proposed which are competing for the best solution. Particle Swarm Optimization (PSO) is a new, advanced, and most powerful optimization methodology that performs empirically well on several optimization problems. It is the extensively used Swarm Intelligence (SI) inspired optimization algorithm used for finding the global optimal solution in a multifaceted search region. Data clustering is one of the challenging real world applications that invite the eminent research works in variety of fields. Applicability of different PSO variants to data clustering is studied in the literature, and the analyzed research work shows that, PSO variants give poor results for multidimensional data. This paper describes the different challenges associated with multidimensional data clustering and scope of research on optimizing the clustering problems using PSO. We also propose a strategy to use hybrid PSO variant for clustering multidimensional numerical, text and image data.
Diffusion Representations
Salhov, Moshe, Bermanis, Amit, Wolf, Guy, Averbuch, Amir
Diffusion Maps framework is a kernel based method for manifold learning and data analysis that defines diffusion similarities by imposing a Markovian process on the given dataset. Analysis by this process uncovers the intrinsic geometric structures in the data. Recently, it was suggested to replace the standard kernel by a measure-based kernel that incorporates information about the density of the data. Thus, the manifold assumption is replaced by a more general measure-based assumption. The measure-based diffusion kernel incorporates two separate independent representations. The first determines a measure that correlates with a density that represents normal behaviors and patterns in the data. The second consists of the analyzed multidimensional data points. In this paper, we present a representation framework for data analysis of datasets that is based on a closed-form decomposition of the measure-based kernel. The proposed representation preserves pairwise diffusion distances that does not depend on the data size while being invariant to scale. For a stationary data, no out-of-sample extension is needed for embedding newly arrived data points in the representation space. Several aspects of the presented methodology are demonstrated on analytically generated data.
Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data
Savage, Richard S., Ghahramani, Zoubin, Griffin, Jim E., Kirk, Paul, Wild, David L.
We present a nonparametric Bayesian method for disease subtype discovery in multi-dimensional cancer data. Our method can simultaneously analyse a wide range of data types, allowing for both agreement and disagreement between their underlying clustering structure. It includes feature selection and infers the most likely number of disease subtypes, given the data. We apply the method to 277 glioblastoma samples from The Cancer Genome Atlas, for which there are gene expression, copy number variation, methylation and microRNA data. We identify 8 distinct consensus subtypes and study their prognostic value for death, new tumour events, progression and recurrence. The consensus subtypes are prognostic of tumour recurrence (log-rank p-value of $3.6 \times 10^{-4}$ after correction for multiple hypothesis tests). This is driven principally by the methylation data (log-rank p-value of $2.0 \times 10^{-3}$) but the effect is strengthened by the other 3 data types, demonstrating the value of integrating multiple data types. Of particular note is a subtype of 47 patients characterised by very low levels of methylation. This subtype has very low rates of tumour recurrence and no new events in 10 years of follow up. We also identify a small gene expression subtype of 6 patients that shows particularly poor survival outcomes. Additionally, we note a consensus subtype that showly a highly distinctive data signature and suggest that it is therefore a biologically distinct subtype of glioblastoma. The code is available from https://sites.google.com/site/multipledatafusion/
A Variational Principle for Model-based Morphing
Saul, Lawrence K., Jordan, Michael I.
Given a multidimensional data set and a model of its density, we consider how to define the optimal interpolation between two points. This is done by assigning a cost to each path through space, based on two competing goals-one to interpolate through regions of high density, the other to minimize arc length. From this path functional, we derive the Euler-Lagrange equations for extremal motionj given two points, the desired interpolation is found by solving a boundary value problem. We show that this interpolation can be done efficiently, in high dimensions, for Gaussian, Dirichlet, and mixture models. 1 Introduction The problem of nonlinear interpolation arises frequently in image, speech, and signal processing. Consider the following two examples: (i) given two profiles of the same face, connect them by a smooth animation of intermediate poses[l]j (ii) given a telephone signal masked by intermittent noise, fill in the missing speech.
A Variational Principle for Model-based Morphing
Saul, Lawrence K., Jordan, Michael I.
Given a multidimensional data set and a model of its density, we consider how to define the optimal interpolation between two points. This is done by assigning a cost to each path through space, based on two competing goals-one to interpolate through regions of high density, the other to minimize arc length. From this path functional, we derive the Euler-Lagrange equations for extremal motionj given two points, the desired interpolation is found by solving a boundary value problem. We show that this interpolation can be done efficiently, in high dimensions, for Gaussian, Dirichlet, and mixture models. 1 Introduction The problem of nonlinear interpolation arises frequently in image, speech, and signal processing. Consider the following two examples: (i) given two profiles of the same face, connect them by a smooth animation of intermediate poses[l]j (ii) given a telephone signal masked by intermittent noise, fill in the missing speech.
A Variational Principle for Model-based Morphing
Saul, Lawrence K., Jordan, Michael I.
Given a multidimensional data set and a model of its density, we consider how to define the optimal interpolation between two points. This is done by assigning a cost to each path through space, based on two competing goals-one to interpolate through regions of high density, the other to minimize arc length. From this path functional, we derive the Euler-Lagrange equations for extremal motionj given two points, the desired interpolation is found by solving aboundary value problem. We show that this interpolation can be done efficiently, in high dimensions, for Gaussian, Dirichlet, and mixture models. 1 Introduction The problem of nonlinear interpolation arises frequently in image, speech, and signal processing. Consider the following two examples: (i) given two profiles of the same face, connect them by a smooth animation of intermediate poses[l]j (ii) given a telephone signal masked by intermittent noise, fill in the missing speech. Both these examples may be viewed as instances of the same abstract problem. In qualitative terms, we can state the problem as follows[2]: given a multidimensional data set, and two points from this set, find a smooth adjoining path that is consistent with available models of the data. We will refer to this as the problem of model-based morphing. In this paper, we examine this problem it arises from statistical models of multidimensional data.Specifically, our focus is on models that have been derived from Current address: AT&T Labs, 600 Mountain Ave 2D-439, Murray Hill, NJ 07974 268 LK.